Aidan Slingsby, City University London, a.slingsby@soi.city.ac.uk [PRIMARY contact]
Jo Wood, City University London, jwo@soi.city.ac.uk
Jason Dykes, City University London, jad7@soi.city.ac.uk
sequenceView was built in a couple of days using Processing - a set of Java libraries for rapidly designing and production of graphical sketches. The giCentre's long experience of using Processing in this way makes such development rapid.
The screen is split into three parts (vertically):
Bases are identified by hue (optionally labelled).
Interactions
The only automated function is computing the longest 'common sequences' in any DNA selection. This takes around 15 seconds. Matches of selected common sequences are identified in red. Common sequences can be hidden from view, leaving only those columns where at least one mutation has taken place in the originally-selected set of DNA, which - with this dataset - fit onto one screen.
Interpretation is left to the human. SequenceView supports the user in doing this by making effective use of alignment, sorting and interaction. SequenceView is responsive and supporting information is provided to the user quickly.
Link to video (18Mb)
Nigeria_B, because it more of its DNA is in common with the current outbreaks than any other native strain. Example in Figure 2 of a long sequence found (red) in Nigeria_B but not in other native strains. Zooming out (Figure 1) or scrolling (video) is quick and shows this pattern occurs throughout.
Steps to screenshots: The screenshots were arrived at by (a) selecting all current outbreaks; (b) requesting the longest common sequences within these (the only automated function); and (c) selecting all the resulting common sequences (matches are then coloured red). This and scrolling through the sequences took a couple of minutes. Central Africa and Cameroon also share significant DNA with the outbreaks.
Identifying DNA sequences common to the outbreaks in the native strains and keeping DNA sequences aligned were key to answering this question.
Figure 1: Zoomed-out view showing native sequences (top), outbreaks (middle) and longest common sequences founds in the outbreaks (bottom). Common sequences are identified in red and those over which the mouse is positioned are coloured yellow.
Figure 2: Section of the DNA sequences showing similarily with Nigeria_B.
The patient with strain 123, because it is more similar to Nicoli's strain (583). Only one base is different (at position 269) as opposed to three based for the other strains (Figure 4).
Steps to screenshots: (a) The three strains were selected (numerically ordered); (b) common sequences computed; (c) these were selected (matching sequences are highlighted in red); (d) non-selected sequences were hidden (Figure 3); and (d) common sequences were hidden leaving just columns in which at least one mutation had taken place (Figure 4).
Hiding common sequences allowed only those bases in which a mutation had occurred - key to answering the question.
Figure 3: (Zoomed-in) view showing all the native sequences (top), just 51, 123 and 583 of the outbreaks and the common sequences (bottom). Common sequences are red.
Figure 4: As figure 3, but with common sequences hidden, and with the base labels showing.
Outbreaks sorted in order of severity, within which the mouseovered column is sorted by its bases (compare Figures 5 and 6).
Assumptions: (a) mutation only affecting one strain not a 'top' mutation; (b) mutations should apply to different sequences; (c) all severe DNA sequences should be covered; (d) mutation correlation does not imply all are needed.
Steps to screenshots: (a) select all outbreaks; (b) find common sequences; (c) select these; (d) sort on severity; (e) additionally sorting by bases in various columns.
The ability to sort on severity and bases in any column, to hide common sequences (isolate mutations) and column index tooltip, were key to answering this.
Figure 5: Sorted by symptom severity and bases on column 269.
Figure 6: As above, but sorted by 223.
We used the same technique and assumption as above, but we sorted the strains on seriousness (sum of the disease characteristics; the grey column). This gave equal weight to these characteristics. Other characteristics could be explored by using the appropriate hue-based lightness (left of the strain) and by sorting by these. White-space is inserted between categories.
Within the sorted category, columns could be sorted on the bases of the column under which the mouse cursor was positioned, spatially consolidating bases of the same type. This helped identify correlation across strains and allows proportion of bases in a particular category to be estimated/counted more effectively.
Seriousness was highly correlated with severity. We could have choosen those mutations that affected just the top two or three seriousness categories, but some of these were answers to the previous question (i.e. strongly associated with severity). Also, since only three strains were in the most serious group, the sample sizes were quite small (see assumptions in previous answer). So instead we opted for complementary mutations (affecting different strains) which together covered most of the top half of serious cases.
We did not find evidence that more than one mutation on the same DNA sequence was necessary. Correlated mutations are candidates, but we decided that the evidence was circumstantial.
We can also explore other important characteristics of the disease, visually.
Mutations that change different characteristics of strains of the outbreak will require different planning responses by health authorities. Identifying key mutations as the pandemic continues is therefore essential for managing the response.
Metrics
Metrics could be computed for each mutation (e.g. number of mutations weighted by seriousness), and we would advocate implementing some to assist in data interpretation in future - they could, for example be used to identify likely DNA candidates for a particular critierion or be used as a basis for sorting.
We did find, however that interpreting the data visually was essential and any further metrics should support rather than replace this. For example, we easily confirmed by mutations were substitutions rather than insertions, by looking for offsets and not finding any. Sorting and alignment using the gaps between disease characteristic categories and horizontal line placement (right click; figure 8) were particularly important. For example, the mutation at 955 does not look significant in isolation, but it is when considered in the context of the other mutations (they are complementary).
Flexibility
Our approach of rapidly building a tool as part of the data exploration process using Processing enables us to incrementally add functionality where needed, including implementing new metrics or loading additional datasets. This allows the tool to grow in line with the depth of analysis required. SequenceView was built for this particular dataset in mind, but the functionality was designed to be as generic as possible and it should work with similar data of the same data format.
SequenceView supports analysts by providing appropriate sorting, symbolism, data hiding, alignment, interaction and the function for automatically identifying the longest common sequences in any arbitrary selected set. This small set of basic functions supports the analyst in answering a wide set of questions about the DNA sequences.
Scalability
SequenceView was designed to accommodate this size of dataset in terms of computation time, graphical display and memory, and it is expected to work for datasets of similar size. In the current design all mutations fit on the same screen once the common sequences have been hidden. If there were many more mutations and/or longer DNA sequences, it might not be possible to fit them all on the screen at once. The screen size of bases could be reduced and scrolling could be used but only if the requirement for scrolling was minimal as this would make visual analysis much more challenging. For datasets that are much larger, some redesign might be necessary. For example, allowing the user being able to hide any number of arbitrary column not currently under consideration. Or summarising/grouping mutations by disease characteristic category instead of showing all DNA sequences.
Programming
Building tools using Processing to answer such questions is efficient for us because of our experience. Processing is, however, designed for designer who are not necessarily programmers, to create 'sketches' like this quickly. We do not aim at this stage to built complete software tools, but rather to design and test visualisation and interaction ideas for visual analytics. However, we are building a reusable set of libraries that are publicly available. We (or somebody else) may implement those designs and ideas into a full piece of software later.
Figure 7: Sorted by overall seriousness.
Figure 8: Sorted by overall seriousness, with horizontal line showing that T base (mouse cursor) is complementary to 842 and 790 for this strain.
Figure 9: Sorted by complications, note likely key mutation (column 223).
Figure 10: Sorted by drug resistance, note likely key mutation (column 22).